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ABSTRACT 



An auxiliary service unit is normally idle, or in coin 
standby. If a demand for the unit's service occurs, the unit 
must be available to satisfy it, or else "ca tast roph^=^ " occurs. 
Policies for periodic inspection and maintenance ot such a unit 
are derived in this paper that maximize the expected time until a 
catastrophe occurs. The policies recognize that inspection, 
maintenance, and repair periods are of non-zero duration, during 
which the unit is vulnerable. They also account for the possi- 
bility of hazardous inspection that may damage the unit, and 
various forms of imperfect repair. 

Important examples occur in the nuclear power industry: a 

unit may be a pump, or emergency diesel generator, and a demand 
may be caused by an initiating event such as pipe break or loss 
of off site power; "catastrophe" equates to loss-of coolant acci- 
dent or melt down. Other examples occur in the military, and in 
emergency services to hospitals. 



Key words: Reliability, availability, maintenance, time to 

failure, inspection, Markov decision process, nuclear 
safety, standby redundancy. 
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INTRODUCTION 



It is common practice to improve the reliability of a system 
by installing cold standby units, which are only brought into 
operation when a standard operating system fails. In particu- 
lar, diesel generators in cold standby may be used to scram a 
reactor in case of a coolant pipe breaking or some other failure 
in a nuclear power plant. Other examples occur in hospital 
power supplies and military hardware. If such a standby system 
fails to operate when it is required, then the consequences could 
be catastrophic. The times when there is a need for the standby 
unit are called initiating events. If the standby system is in 
a failed state, when an initiating event occurs, then a catas- 
trophic event is said to occur. 

It is necessary to inspect and maintain the standby system 
from time to time. If inspection reveals it to be in an unsatis- 
factory state, repairs are made. The idea is that the standby 
unit can go down even when it is not operating and this will 
cause it to fail to operate the next time it is needed. 

The following policy has been proposed for the inspection of 
diesel generators in a reactor. After a generator is found to 
be down on inspection and is repaired, it undergoes K inspections 
at short intervals of time. If it is found to be up at each of 
these short inspections, then it is inspected at long intervals 
thereafter until it is found to be down. Whenever a generator 
is found to be down and is repaired, inspections start with the 
K short inspection intervals again. This type of inspection 
policy reflects the idea that after the system is repaired it 
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should be inspected more often for awhile to ensure it was re- 
paired correctly. In Section 2 we present a model for this 
inspection policy and derive an expression for the expected time 
to a catastrophic event. 

In Sections 3 through 5 we v?ill use various Markov decision 
and renewal theoretic formulations of the problem to investigate 
the forms of the optimal inspection policies which maximize the 
expected time until a catastrophic event occurs. This will show 
us how certain assumptions about inspection and repair of the 
standby system affect the form of the inspection policy. 

Almost all the previous work on inspecting a single standby 
unit uses a cost criterion. Barlow and Proschan [2] described 
the basic average cost per unit time model with accurate instan- 
taneous inspection and faultless repair, while Luss and Kander 
[9] allowed for non-zero inspection times. Wattanapanom and 
Shaw [20] studied the problem when inspection is hazardous, so 
that it is possible for the inspection to cause the unit to fail. 
Nakagawa [11] looked at the probability that at an initiating 
event the standby system will work, while Butler [3] maximized 
the expected lifetime of the standby unit, but did not allow re- 
pairs. His model allowed the standby unit to be in more than one 
'up' state, which are distinguishable only upon inspection. This 
connects with the work on partially observable Markov decision 
processes [1,10,16], and in particular the problem of optimal 
inspection and repair of a deteriorating process with imperfect 
information introduced by Ross [13] and generalized by White [21] , 
Rosenfield [ 12 ]/ Luss [8], Sengupta [15], Suzuki [17], and Wong 



2 



[19]. In these papers, a system can be in more than one state, 
but which one is known only imperfectly or upon inspection. 

Our models of the inspection and repair of the standby sys- 
tem allow for non-zero inspection-maintenance times and non-zero 
repair periods, but we ignore the time the unit is in use. The 
idea is that during inspection-maintenance and repair the unit 
can not react to an initiating event and so these are critical 
times for the system, whereas we make the assumption that the 
time the standby system is actually in use is so small it can be 
neglected. We also allow for imperfect repair and hazardous 
inspection, so that even if the unit is up on inspection, it 
might be down immediately after. Thus we explicitly represent 
possible mistakes in inspection, and allow for incorrectly iden- 
tifying the unit as working when in fact it was down. Another 
model considered allows the unit to be in one of two 'up' states, 
which are indistinguishable on inspection, but have different 
failure rates. This is intended to incorporate the idea that a 
repair might put right the superficial cause of the unit's failure, 
but not deal with the underlying problem, which will recur. 

In Section 3, we introduce our basic discrete time models 
where the unit can only be either 'up' or 'down'. The times 
between initiating events are assumed to have a geometric distri- 
bution. We describe the case where successfully dealt with 
initiating events are recorded as showing the unit was working 
at that time. By modelling this as a Markov decision process 
we can find the form of the optimal inspection policy to maximize 
expected time to a catastrophic event. We compare this with the 
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case where we ignore any information from successfully dealt 
with initiating events. We also look at the expected times until 
a catastrophic event under different policies, and optimize the 
probability that the system will last at least a fixed number 
of time periods. Section 4 describes the equivalent continuous 
time model and shows how the discrete time results are replicated 
if the lifetime of the unit is exponential and the initiating 
events occur according to a Poisson process. We also investi- 
gate the optimal inspection policy for general lifetime distribu- 
tions. Section 5 generalizes the discrete time model to allow 
the unit to be in two 'up' states. In certain cases the optimal 
inspection policy for this model has quite short inspection 
periods immediately after a repair, which then lengthen as 
further inspections suggest the system is in the "better" up 
state . 
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2. CONTINUOUS TIME MODEL WITH TWO-UP STATES AInID SHORT-LONG 

INSPECTION POLICY 

Assume the system can be in one of two up-states j =1,2 
until it fails. The two up-states are indistinguishable upon 
inspection. After a repair the system goes to up-state j with 

probability ^ and remains there until it fails. After a repair 

the conditional distribution of the time to failure given it is 
in up-state j is , independent of the past. 

After a repair the system is inspected and maintained at K 

short intervals of length S. If the system is found to be up 

at each of the K short inspection intervals, then future inspec- 
tions occur at long intervals of length L > S. If the system is 
found to be down upon inspection, it is repaired and then in- 
spected at K short inspection intervals again before the long 
inspection intervals begin. If the ^system is found to be up 
upon inspection, routine maintenance is performed. Given the 
system is in up-state j , the conditional distribution of the 
time to failure after an inspection is , independent of the 
past. Some reasonable and tractable examples of distributions 
Fj and G^ are the exponential, and the exponential with a proba- 
bility atom at the origin reflecting hazardous inspection or 
faulty repair. 

Inspection-maintenance takes M units of time and repair 
takes R units of time. Initiating events occur according to a 
Poisson process with rate j. The system is unable to respond 



to an initiating event during inspection-maintenance or repair. 
A catastrophic event is said to occur if an initiating event 



occurs when the system has failed or is being inspected, 
maintained, or repaired. Let T denote the time of the first 
catastrophic event. We will derive an expression for the ex- 
pected value of T. 

Let f(j,k) = E. , [T] denote the expected time to the first 
catastrophic event given k = 0,1,..., K short inspection periods 
have already successfully taken place and the system is in up- 
state j. Let f(j,J,) = E. p [T] denote the expected time to first 
catastrophic event given a successful inspection has just taken 
place, the next inspection period is long, and the system is 
in up-state j . 

A probabilistic argument gives the following system of 
equations; (F^ (S) = 1 - Fj (S) ) . 



f (j,0) 



(S) e ^'^{S+M+f (j ,1) } 

_ S 

+ G.(S) / (S+u)ve du 

0 

S S-u+R 



+ j G.(du) / (u+z)ve 

0 3 0 



+ / G.(du)e [S+R+ 

0 ^ 



-vz 



2 



I 

j = l 



( 2 . 1 ) 



dz 



TTjf(j,0)] ; 
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for 1 < k < K-1, 



f(j,k) = Fj (S)e ^^{S+M+f (j ,k+l) ) 



M 



+ F. (S) / (S+u) ve ^ du 



S S _ 

+ / F. (du) I (u+z)ve dz 

0 ^ 0 



+ F. [S+R+ I TT. 

0 ^ j=l ^ 



where f ( j ,K) = f ( j , £) ; 



f(j,£) = Fj (L) e ^^[L+M+f ( j ,£) ] 



M 



-vu 



+ F.(L) / (L+u)ve du 

^ 0 



L-u+R 



+ / F.(du) / (u+z)ve dz 

n 3 n 



+ / F.(du)e [L+R+ I 

0 ^ j = l 



TT . 



After some simplification, equations (2.1) -(2. 3 



f(j,0) = aQ(j,S) + Pq ( j ,S) f ( j , 1) + ( j ,S) TTf (0) 



for 1 < k < K-1 



( 2 . 2 ) 

f ( j ,0) ] , 

(2.3) 

f(j,0)] . 

) become 

(2.4) 
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f (j,k) 



= a(j,S) + p( j ,S) f ( j ,k+l) + c(j,S)TTf(0) ; 



f(j,£) = a(j,L) + p(j,L)f(j,£) + c(j,L)7Tf(0) 



where 



7Tf(0) = 



I 7T.f(j,0) ; 

j=l ^ 



PQ(j,S) = Gj(S)e 



-VM 



CQ(j,S) = e 



-v(S+R) ( vu 



/ e^'^G.(du) ; 

0 ^ 



ao(j,S) = - [1 - - Cg(j,0)] 



G. (S)S + / uG . (du) ; 



P(j/t) = 



Fj (t)e 



-vM 



c(j,t) = e I . (du) ; 

0 ^ 



a(j,t) = - [1 - p(j,t) - c(j,t)] 



V 



+ F.(t)t + / u F . (du) 



In the special case in which F^ has an exponential 
bution with an atom at the origin, 



(2.5) 

( 2 . 6 ) 

(2.7) 

( 2 . 8 ) 

(2.9) 

( 2 . 10 ) 
( 2 . 11 ) 
( 2 . 12 ) 
( 2 . 12 ) 

distri- 
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0 



if t < 0 , 



F (t) = 


1 




^ 1 


1 




1 


i, (1-Uj) + Uj [1-e ^ ] if t > 0 




then 






P(j/t) = 


-t 

1 -vM 

a . e e 

3 


(2.13) 


c ( j , t) 


^-v(WR) ^ «j 

1 J 6 .-V 

■’ ■’ 3 


(2.14) 


a(j,t) 


^ [l-p( j ,t) - c ( j , t) ] 


(2.15) 




, -6 .t 

+ a . T — [ 1-e ^ ] 





Solving equations (2. 4) -(2. 6) recursively leads to the 
following expression for the expected time to the first cata- 
strophic event given the system has just been repaired 



TTf (0 ) 



NUM 

DEN 



(2.16) 



where 



2 

NUM = I IT. [aQ(j,S) + Pq ( j ,S) g^^ (K-1) ] (2,17) 

j = l ^ 



where 



^N ^ ^ ^ ^ 



[l-p(i,S)^ 
l-p( j ,S) 



a ( j ,S) 



+ p(j ,S)^ ^ 



a ( j ,L) , 
l-p(j,L) ' 



(2.18) 
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DEN = 7Tj[l-{cQ(j,S) + pQ(j,S)gj^(K-l) }] 



(2.19) 






U-p(j,S)^ 



P(j/S) ] c>\ I nf.; c(j,L) 

l-p(j,S) c(3,S) + p(3,S) L) • (2.20) 



EXAMPLE. The rate of initiating events is v = 0.1 per week. 
ir^ = 0.9 = 1 - TT^ . The length of an inspection-maintenance 
period M is weeks. A repair period, R, is weeks. 



if t < 0 , 



Fj (t) 



- 6 . t 



(1-OKI) + OKI[l-e ^ ] if t > 0 



if t < 0 



Gj(t) = 



-6 .t 



(1-OKR) + OKR[l-e ^ ] if t > 0 



Assume <5j^ = 



9 1 

2 ^g per week, ^2 ~ 2 week. Note that after a 



repair the conditional expected time to system failure given the 
system is up is ir, i = 26 weeks. Thus, if after a 

1 61 2 62 

repair, no inspections are done, then the expected time to a 
catastrophic event is (OKR) (26) + ~ = (OKR) (26) + 10 weeks. 

An exploratory numerical study was conducted of the best 
values of S, L, and K for various values of OKI, OKR. We 
restricted our attention to the case in which inter-inspection 
periods are in integer numbers of weeks. Equations (2 . 16) - (2 .20) 
were evaluated numerically for various parameter values. Some 
results are summarized in Table 1. 
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Table 1 



Best 

Trf { 0) 



Expected 
Time if 

no inspections 



OKI 


OKR 


Best S 


Best K 


Best 


0.9 


0.9 


1 


2 


2 


0.5 


0.5 


oo 


- 


- 


0.5 


0.9 


9 


1 


1 


0.9 


0.5 


1 


1 


3 or 



L 



4 



69.24 

23 

38.39 

50.63 



33.4 

23 

33.4 

23 



If the quality of the repair is better than the quality of 
inspection (OKR > OKI) then it appears to be better not to in- 
spect often initially after a repair but then to inspect more 
often as time goes on. If OKI > OKR then it appears to be 
better to inspect soon after a repair and if the system is up 
at inspection not to inspect for a longer period of time 
thereafter. If both repair and inspection are of poor quality 
then it appears to be better not to do anything. Note that the 
expected time to a catastrophe seems to be more sensitive to 
OKI than to OKR. 

In the remainder of the paper we will study optimal inspec- 
tion policies. 
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3. DISCRETE TIME, ONE-UP-STATE, MARKOV DECISION PROCESS MODELS 
MODEL 1 

In the first model, the standby unit can either be 'up' or 
'down', when it is not in operation; and, if n basic time periods 
e.g., days, have elapsed since the unit was installed, s^ is 
the probability that it will be 'up' at the next time period 
given that it is 'up' in this (the n^^) time period. Once the 
unit goes 'down' it remains 'down' until either it is success- 
fully repaired or else a catastrophic initiating event occurs. 

Each time period, the operator can inspect the unit, repair it, 
or do nothing. If the inspection finds the unit is 'up', no 
repairs are made, but there is a probability (1-i) that the 
inspection was actually hazardous or damaging, and so the unit 
is 'down' immediately after inspection. An inspection which 
finds the unit up takes M periods, where M need not be integer; 
during this period the unit cannot respond to an initiating event. 
If, on inspection, the unit is found in the down state, a repair 
is attempted, which with probability r will return the unit to 
the 'up' state and with probability (1-r) leaves it in the down 
state; this takes a total time of R periods to perform (R ^ M) ; 
again the unit cannot respond to an initiating event during this 
period. If the operator decides on a repair without inspection, 
the unit is again out of operation for R periods and has proba- 
bility r of being in the 'up' state immediately afterwards, 
irrespective of whether it was up or down before the repair. 

An initiating event, i.e., one that demands the standby 
unit's services, occurs at random with probability 3 each period. 



12 



i.e., according to a Bernoulli trials process, so the times 

between events are independent and geometric. In this model we 

assume the operator is aware of those initiating events, to which 

the standby unit responded satisfactorily. This implies the 

unit was 'up* at that time, and although we neglect the time 

it was in operation, we say there is a (1-c) chance that its use will 

have caused it to go down by the end of the period. So if it was used 
t h 

the n period after the unit was installed, there is a probability 
c, that it will be 'up' at the next period. (If c = 1, use is 
not hazardous.) If the standby system is down or is being 
inspected or repaired when an initiating event occurs, a cata- 
strophic event occurs. The objective is to maximize the expected 
number of periods until a catastrophic event occurs . 

The situation described can be treated as an infinite-state 
Markov decision process. The state space is describable as 
S = {(p,n), 0 _< p _< 1 , n = 1,2,...} where p is our belief that 
the unit is 'up' this period, and n is the number of periods 
since the standby unit was installed. There are three actions 
open to us at each state--do nothing, inspect or repair. Let 

V(p,n) be the maximum expected number of periods until a cata- 

t h 

strophic event, given that this is the n period since installa- 
tion, and p is our belief at this time that the unit is 'up'. 

Standard dynamic programming arguments [14] show that V(p,n) 
satisfies the optimality equation. 

V(p,n) = max{Wj^ (p,n) , W 2 (p,n), W 2 (p,n)} (3.1) 
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where ; 



(p,n) 
W 2 (p,n) 



(p,n) 



1 + (1-3 ) V( s^p,n+l) + 3pV(c,n+l) 

p[(l-(l-B)^)/3 + (l-3)^V(i,n+M) ] 

+ (1-p) [(l-(l-3)^)/3 + (l-3)^‘V(r,n+R) ] 

(l-(l-3)^)/3 + (l-3)^(r,n+R) 



Note that 



(l-(l-3)^)/3 = 3 + 23 (1-3) + 33 (1-3)^ + ... + M(l-3)^' ^ 



is the expected number of periods to pass, up to a maximum of 
M, until an initiating event occurs. (p,n) represents the 
payoff from an action; for example Wj^(p,n) corresponds to doing 
nothing, where with probability (1-3) no demand occurs, while 
with probability 3p an initiating event is successfully dealt 
with and with probability (l-p)3 a catastrophic event occurs. 
(3.1) is an example of Denardo ' s contraction operator approach 
to dynamic programming [4] , and hence the optimal policy is inde- 
pendent of the past history of the system and consists of 
inspecting in state (p,n) if W 2 (p,n) > max{Wj^ (p,n) , W 2 (p,n)} 

repairing if W^(p,n) > max{Wj^ (p,n) , W 2 (p,n)}, otherwise doing 

nothing . 

As there is a probability 3(1 - max{s, }) of a catastrophic 

k ^ 

event within two periods from any state and under any policy. 
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we have 



1/3 £ V(p,n) £ 2/3 (1 - max (s, ) ) . (3.2) 

k ^ 

It is easier to work with V(p,n) = V(p,n) - 1/3, which is the 
expected extra time until a catastrophic event because there 
is a standby unit, (3.1) then becomes 

V(p,n) = maxiWj^ (p,n) , W 2 (p,n), W 2 (p,n)} (3.3) 



where ; 



Wj^(p,n) = p + (1-3) V(s^p,n+1) + 3pV(c,n+l) 

W^(p,n) = p(l-3)^(i,n+M) + (1-p) (1-3) ^(r,n+R) 

W^(p,n) = (1-3) ^V(r,n+R) . 

Lemma 3.1. 

If s are non- increasing in n then V(p,n) is convex and 
n 

nondecreasing in p, and non- increasing in n. 

Proof . Apply value iteration to solve (3.3); the iterates 
V^(p,n) satisfy 

i p + (1-3) V^(s^p,n+1) + 3pV^(c,n+l) 

p(l-3)^ V (i,n+H) + (1-p) (1-3)^ V (r,n+R) 
^ m m 

(1-6) ^ V^(r,n+R) . (3.4) 
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Let VqCp/R) = 0 for all p and n, which is convex and non- 
decreasing in p and non-increasing in n. Since the sum of 
convex functions, and the maximum of convex functions is convex, 
if V^(p,n) is convex for all p and n so is Thus 

by induction V^(p,n) is convex in p and since by [14 ] , 

V^(.,.) converges to V(.,.) the solution of (3.3), this limit 
function is also convex in p. 

Again notice that if V^(p,n) is non-decreasing in p for all 
n, so is p + (1-3) V (s p,n+l) + 3pV (c,n+l) since V (...) > 0 



is non-decreasing in p. Hence maximum of these 

two non-decreasing functions, is non-decreasing and the induction 
step goes through. In the limit as m this proves V(p,n) 

is non-decreasing in p. 

For the dependence of V(p,n) on n, we again use induction 
in the iterates V^(p,n) : notice that (3.4) implies 



and also max 



{p(l-3)^^(i,n+M) + (1-p) (1-3)^ V^(r,n4R ) , (l-3)^V^(r ,n-<p. ) } 




{ (1-3) (V^(Sj^p,n+l) - V^(s^_^j^p,n+2)) + 3p (V^^ (c , n+1) - Vj^(c,n+2)) 



max 



M ~ P 

p(l-3) (V^(i,n+M) - Vj^(i,n+i+M)) + (1-p) ( 1-3) ^(V^^(r ,n4-R) 



V (r,n+l+R)) 
m 



(l-3)^(Vj^(r,n4R) - Vj^(r,n+1+R) ) } . 



(3.5) 



Assume V (p,n) > V (p,n4-l) for all p and n, then the fact 

V^(p,n) is non-decreasing in p means that, for all p. 
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V (s p,n+l) 





' ( s , 1 p , n+1 ) ) 
m n+1^' ' 




> 0 . 



(3.6) 



Hence (3.5) gives 

and the induction hypothesis holds. Thus, the limit function 
V(p,n) is also non-increasing in n. 

These results help to describe the optimal policy. 

Theorem 3 . 1 

The optimal policy is given by a set of numbers p*, 
n = 1,2,... where, n periods after installing the standby 
system, one does nothing in state (p,n) if p > p^; 
inspects if p £ p* and (l-3)^(i,n) ^ (1-B)^ V(r ,n) ; and 

repairs if p £ p* and (l~3)^V(i,n) < ( 1-3)^ V (r , n) . Notice if 

i ^ r, then one never repairs as (l-3)^(i/n) > (l-3)^V(r,n) 
for all n. 

Proof . Notice that if (1-3) V(i,n) ^ (1-3) V(r,n), then W 2 (p,n) >W^(p,n) 
for all p; otherwise W^(p,n) ^ W 2 (P/n) . Now look at 
{pIWj^(p,n) _< max {W^(p,n)}}, which is the set of states (p,n) 



where it is not best to do nothing. Since both W 2 (p,n) and 
W^(p,n) are linear in p and V(p,n) is convex, we get for any 
Pj^ and P 2 in the above region and any A, 0 £ X £ 1. 



( Apj^t (1-A ) P 2 ,n) 



XW^ (Pj^ ,n) + ( 1 -A) (P2 fO) 



XV(pj^,n) + (l-X) V(p 2 /n) £ V ( Xp^^+ ( 1-X ) P 2 , n) 



(3.7) 
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where i = 2 or 3 depending on which is the maximum. Hence (3.7) 

implies max ( Apj^+ (I-AP2) ^ (Ap^^+ ( 1 -A) P2) and so the region 

r — 2^3 

where it is not best to do nothing is convex. 

From (2.3) we have 

V(0,n) = max{ (1-3) V(0,n+1) , (l-3)^V(r ,n+R) } . (3.8) 

If it were best to do nothing at p = 0 , this would imply 
V(0,n) = ( 1-3 ) V ( 0 , n+1 ) , which contradicts V(p,n) is non- 
increasing in n. Hence (0,n) is in the convex region where 
it is not best to do nothing. Let p^ be the maximum value of p 
in this region and the result holds. 

In fact the model can be rewritten so that the state space 
is countable, since not all possible values of p are possible. 
Let S = { (m,x,n) ,m = 0,1,2,..., x = i, r or c, n = 1,2,3} 
where (m,x,n) is the state when the unit is n periods since 
installation and m periods since the end of the last inspection, 
repair or successful response to an initiating event; x = i if 
this last occurrence was an inspection that found it up; x = r 
if it was a repair and x = c, if it was a successfully dealt 

with initiating event. The probability p that the unit is up 

m 

in this state is p(m,x,n) = x IT s , and so the optimality 

k=l 

equation (3.3) becomes 
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1 



p(m,x,n) + (1-B) V(m+l,x,n+l) 



+ Bp(m,x,n) V(0,c,n+1) ; 



(3.9) 



(l-B)^V(0,r,n+R) 



and the optimal policy of Theorem 3.1 can be reinterpreted. 
Corollary 3.1 ; 

If, at n periods after installation, an initiating event 
is successfully dealt with, inspect or repair next in T^(n) 
periods unless there is another initiating event before then; 
if at n periods after installation, the unit has just been found 
to be 'up' on inspection, inspect or repair next in T^(n) 
periods unless an initiating event occurs; if at n periods after 
installation the unit has just finished a repair, then inspect 
or repair in T^(n) periods unless a prior initiating event 
occurs. If i > r one always inspects, otherwise the repair 
or inspect decision depends on the number of periods since 
installation . 

Proof . This is just a matter of pointing out that 




T.(n) - < p*^^) , 
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T (n) 
r 




Notice that T (n) , T. (n) , T (n) reflects the ordering of 
c, i and r, so if c ^ i ^ r then T^(n) ^ T^(n) ^ T|n), etc. 

The dependence of this policy on n follows because the 
failure rate (1-s^) is age-dependent. We would expect that if 
s^ decreases with n, and consequently the failure rate is in- 
creasing, then T (n) , T.(n) and T (n) will also be non- increasing 

O 1 31* 

in n. This reflects the fact that in the long run, the aging 
of the unit will lead to more frequent inspections. At the 
moment we are more interested in the effect of inspections and 
repair before aging starts to play a part. The interesting 
decision to replace an aging unit will not be analyzed at this 
time. From now on, assume that the failure rate is constant, 
which leads to the following simplification of Model 1. 



Assume s = s for all n in Model 1, and c = i. This corres- 



ponds to thinking of an initiating event successfully dealt 
with as an inspection which takes zero time. The state space 
becomes S = { (m,x) ,m = 0,1,2, x = i, or r}, the optimality 
equation (3.9) becomes 



Model 2 



n 



I 



xs^ + (1-B) V(m+1 ,x) + Bxs^ V(0,i) ; 



V (m,x) 



max 



xs"'(l-B)^ V(0,i) + d-xs"^) (1-3) ^ V(0,r); (3.10) 



(1-3) 



R r. 



V(0,r) . 
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and the optimal policy is either of the form tt^(T^,T^) or 
n (T.,T ); 7r.(T.,T ) means inspect T. periods after a success- 
ful response to an initiating event and periods after the 
end of an inspection or periods after the end of a repair, 
unless another initiating event occurs, whereupon inspect if 
more periods elapse without another initiating event. 

TTr(Ti,Tr) means repair periods after a successf ully-dealt- 
with initiating event, or periods after last repair, unless 
another initiating event, or periods after last repair, 
unless another initiating event occurs. Notice that one either 
always inspects or always repairs depending on the values of 
(l-g)^(0,i) and ( 1-3 V ( 0 , r) . 

Although the state space is infinite we can apply variants 
of policy iteration and value iteration which solve the Markov 
decision process to find the optimal policy and optimal expected 
time to a catastrophic event. For any policy tt^(T^,T^) there 
are only + 2 states the unit can be in. So for, any 

expected policy we can calculate the corresponding expected 
time. Since the problem is equivalent to one with discount 
factor (1-3 (1-s)), we can apply the bounds in White [22] to 
find a finite state approximation, whose value is within any 
prescribed amount of the optimal value. These bounds tell us 
how many states (m,x) we need to consider. The results given 
in Table 2 are the optimal policy and optimal expected time for 
different values of 3, i, r, s, M and R , together with the 
expected times under other policies. The numbers we have chosen 
reflect an underlying model, in which inspections can be scheduled 
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at discrete times, say at multiples of a week. However, a 
repair or inspection takes only a fraction of this time. Al- 
though our theory was worked out for integer inspection and 
repair times, we take the same formula to approximate non- 
integer times. The inspection policy Tr^(l,0) means inspect 
one period after last inspection or last initiating event and 
immediately after a repair, while ir^( 0,100+) means repair 
immediately after any initiating event or at least 100 periods 
(100+) after a repair. 

Notice the optimal policy is almost insensitive to whether 
3 = 0.05 or 0.01 and the expected time to a catastrophic event 
is affected more by increases in i than r or even s. The 
policy TT^(n,0) to inspect immediately after a repair is optimal 
if the probability of a repair not being effective is quite 
high, say 0.4. Similarly, the model suggests one should not 
inspect i.e., tt^(.,.) if inspection is more hazardous than 
repair, i < r. 

MODEL 3. 

We might want to change our criterion from maximizing ex- 
pected time until a catastrophic event to maximizing the proba- 
bility that the system lasts at least n periods until a 
catastrophic event. This might be the case if the unit is to 
be completely replaced after n periods. If we apply this 
criterion to Model 2, (p) the probability that the system lasts 

at least n periods before a catastrophic event, given we believe 
it is 'up' at present with probability p, satisfies the 
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OPTIMAL POLICY EXPECTED TIMES TO CATASTPDPHIC 
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optimality equation 



Pq(p) = 1 for all p. 

t (l-3)Pj^(sp) + 3p ?n(i) 

P(l-8)“ f'n+l-M*^) * <1-P> 

where M = min(M,n+l) , N = min(R,n+l). The optimal policy is 
again of a control-limit type. 

Theorem 3.2. 

The optimal policy to maximize the probability of lasting 
n periods is given by the sequence p^,p*,...p*, where with k 
periods to go, do nothing if p > p*, inspect or repair if 

P 1 Pj' if ^n+l-N^^^ - ^n+l-M^^^ ' inspect 

otherwise . 

Proof . As in Theorem 3.1, prove by induction that Pj^(p) is 

convex and non-decreasing in p and non-increasing in n. The 

convexity of Ppj(p) snd the linearity of the second two terms 

in the maximization in (3.11) then gives the result. 

If the state space- is changed to S = { (m,x) , m = 0,l,2,...,x 

or r), by noting p = xs^ at (m,x) , the obvious change occurs in 

the optimal policy. In Table 3 we compare the maximum 

chance of lasting n periods before a catastrophic event for 

★ 

n = 10 , 50 and 200 with the same chance under the policy it 
that maximizes the expected time to a catastrophic failure. 
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These figures are similar to those given for Model 2 except 
that the length of period is 1/10 of that there. So we can 
think of the probabilities as those of lasting 10, 50 or 100 
weeks without a catastrophic failure. The optimal policy for 
maximizing expected time until failure does very well in almost 
all cases. 

Model 4 . 

Suppose any information derived from having successfully 
dealt with initiating events, as in Model 2, were ignored; 
what changes would occur? We can no longer model this as a 
Markov decision process period by period since in these we cannot 
ignore information we know. However, we can construct a renewal 
theory model, for each end of inspection or end of repair is 
a type of renewal point. Thus we can define as the maxi- 

mum expected time to a catastrophic event starting immediately 
after a repair or an inspection . The rest of the model 
is the same as Model 2, with i, r, s, M, R having the same 
meaning as there. The optimality equation is then 



max{L.(T.) + is ^ ( ( 1- (1-B ) ^) /3 + (1-3)^ V.) 
mil 1 



T . 



V. 

1 



+ p^(T^) ((l-(l-3)^/3) + (1-3)^ v^) } 



max / + rs ^ ( ( 1- ( 1-3 ) ^) /3 + (1-3)^ 



T 



V 



r 




+ (1-3)^)V^) 




(3.12) 



26 



where L (T) = T - J [1-ps^] [1- ( 1-3 ) ^ is the expected number 

P i=0 

of periods, up to a maximum of T until a catastrophic event 

occurs, if p is the probability the unit is up at the start 

T-1 

of the first period; and p (T) = (1-xs*^) - I [ 3 ( 1-3 ) ^ [l-xs'^~^~^ 

^ i=0 

is the probability that after T periods the unit is down but 

no catastrophic event has occurred given that initially it was 

up with probability x and down with probability 1-x. Again it 

is easier to work with V = V - 1/3 and the arguments of Markov 

renewal programming [7] , show that the optimal policy is either 

TTi(Ti,Tr), i.e., inspect after last inspection and after 

last repair, or 7r^(W^), i.e., repair after last repair. 

Using (3.12) we can calculate under these policies. For 

TT . (T . , T ) 

1 1 r 

,, ^ rd-s*^^) (l-(l-3)^ is‘^^)+i(l-s'^^) (1-3)^ rs*^^ 

(1-s) [(l-(l-3) is ^)(l-(l-3)^ P^(T^) )-(l-3)'^ ^ p^(T^)rs 

(3.13) 



while under tt (W ) 
r r' 



V 

r 



r(l-s ) 

(l-s) (l-(l-3)^(l-Pj,(W^) ) ) 



(3.14) 



We calculate the optimal policy for the examples we did in 
Model 2, and so it is useful to compare the results with those 
given there. The results can be found in Table 4. 
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TABLE 
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There are no great changes in the maximum time until a 
catastrophic event. Notice that there are examples where model 
5 has a longer expected time. This may seem strange at first, 
since in Model 5, we are ignoring information-- the occurrence 
of a successfully dealt with initiating event — which we use in 
Model 2. However to counterbalance this, in Model 5, it is 
implicit that after a successfully dealt with initiating event, 
the stand-by system is bound to be up, while in Model 2, it is 
only up with probability i. This also explains the difference 
in policy for the fourth example. Since repair and inspection 
are so bad, we do nothing to interfere with it under Model 5, 
but in Model 2 because after each successfully dealt with 
initiating event there is only a .5 chance it is up, we must 
keep inspecting it to see if this has occurred. Otherwise the 
only difference in policies is that the inspection intervals are 
slightly longer in Model 5 than in Model 2. 
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4. 



CONTINUOUS TIME MODEL WITH ONE UP STATE 



In this section we look at the continuous time analogue of 
the standby unit model described in Section 3. Again, the 
standby unit can be either 'up' or 'down', and remains down 
either until it is inspected and repaired, or until a catastrophic 
initiating event occurs. An inspection takes a time of M, and 
if the unit works on inspection, nothing is done, and the life- 
time of the unit thereafter is given by the distribution function 
( • ) . The repair of a unit, found to be 'down' on inspection, 
takes, altogether with the inspection, a time of R and the 
lifetime distribution function thereafter is ( • ) . (The discrete 
time models have distribution functions corresponding to a point 
mass at zero together with a geometric distribution.) The times 
of the initiating events are given by a Poisson process with 
parameter v, (so average inter-initiating event time is v ^) . 

Again, we think of an initiating event that finds the unit up 
as the equivalent of an inspection. The problem is to find the 
times between inspections and between a repair and the next 
inspection which maximizes the expected time until a catastrophic 
event . 

From the work of Doshi [5] on continuous time Markov deci- 
sion processes, it follows that the optimal policy has a deterministic 
time T^ between inspections and a deterministic time T^, between 
a repair and the next inspection. Moreover, if V^, (V^) are the 

maximum expected time to a catastrophic event starting after 
an inspection (repair) , [5] implies and satisfy the 

optimality equation: 
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X _ . _ -vT -v^T _ 

V = sup{ / ve (t + F (t)V. )dt +T e [e ^ F (T ) 

^T^OO XI X XX 

X— 

M — vT N 

( / tve“^^dt + Me”^^ + V.)]+e ^ F (T ) ( / tve“^^dt 

0 ^ ^ 0 






(4.1) 



where F(t) = 1 - F(t) and x = i or r. The and that 
actually maximize the R.H.S. of (4.1) are the optimal inspec- 
tion times. Again, it is simpler to work with V = V - 1/v, 

X X 

which is the improvement in expected time until a catastrophic 
event when there is a standby system, over when there is no 
standby system. If V^(T^,T^), V^(T^,T^) are these improvements 
starting from an inspection and from a repair, when inter- 
inspection time is and is the time from repair to an 
inspection, we get by rearranging (4.1) that 



V (T . ,T ) 
X 1 ' r 



X ~ — vT 

e"^^F (t)dt + V.(T.,T ) [e p (t ) 

X 1 1 r 



X 



X 



+ 




ve 



-vt 



F^(t) dt] 



+ V (T . ,T 
r 1 r 



-vT 

e ^^e "V^CT^) (4.2; 



Solving the system of equations (4.2) we get 



V. (T. ,T ) 
1 1 r 



A(T^,T^)/C(T^,T^) 



V (T. ,T ] 
r 1 r 



= B (T^,T^) /C (T^,T^) 



(4.3) 



31 



where 



A(T. ,T ) 
1 ' r 



-v(R+T )_ 

(1-e F^(T^)) / e^'^F. (t)dt 

X r 0 1 



-v(R+T.)_ \ _ 

+ e F. (T.) / e ^"^F^(t)dt 

1 1 0 ^ 



(4.4) 



B(T. ,T ) 
1 r 



-V (M+T . ) _ 

(1-e F^(T^)) / 



e“^^F (t)dt 
r 



-V(M+T )_ 

+ e ^ F (T ) / e ^^F. (t)dt . 

0 



(4.5) 



-v(M+T.)_ -V (R+T ) 

C(T.,T ) = 1 - e ^ F.(T.) - e ^ F (T ) + 

1 r 1 1 r r 



-v(M+R+T.+T ) 



-V(R+T ) 

^ ^ tF^(T^) - F^(T^)]- (1-e ^ F^(T^)) / V 



- e 



-v(T.+R) r 

^ F. (T.) / ve ^'^F (t)dt . 

■L 1 0 



(4.6) 



If there are optimal finite inspection intervals T^, T^, they 
must satisfy for x = i and r. 



A'(T.,T )/A(T.,T ) = B'(T.,T )/B(T.,T ) 

X 1 r 1 r X 1 r 1 r 



C (T. ,T )/C(T. ,T ) 
X 1 r 1 r 



(4.7) 



where 



A! = 3A/9T. and A' = 9A/9T , etc. 

1 1 r r 



^(t)di 
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In the special case where the extra time for a repair is 
zero and the lifetime of the unit is the same whether an in- 
spection or a repair has just taken place, we can show that 
the optimal inspection times are finite. In this case 

= V, M = R, F^(-) = F^(-) = F(-) and = T, 

so (4.3) becomes 



V(T) = A(T)/C(T) 



(4.8) 



where 



T 

A(T) = / e“^^F(t)dt 



(4.9) 



C(T) = 1 - - / ve-^^F(t)dt 

0 



(4.10) 



Lemma 4.1. 

Optimal inspection time T* is finite and 
V(T*) = F(T*)/v(e^^ - F(T*)). 



Proof 

At a local maximum or minimum V (T) = 0 which implies 
h(T) = A'(T)C(T) - C'(T)A(T) = 0 since C(T)^ > 0, where 



h(T) = e ^"^[F(T) (1-e 



-v(M+T), -vM ( -vt— ,, , , 

') - ve j e F ( t) dt] ; 

0 

(4.11) 



— \)T 

h(0) is positive and though h(<») = 0 notice that h(T) = e g(T) 
and as T oo, g(T) < 0. This shows that T = «> is a minimum 
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turning point and that there is a finite turning point which 
is a maximum. 

We could repeat the whole analysis for the continuous time 
analogue of the model where we ignore successfully-dealt-with 
initiating events, or at least do not consider them inspec- 
tions. Using the notation of Model 1, the optimal values 
and satisfy 

T 

X t \ ^ _ ,1- 

V = sup{ / tdt J f (u)v^ ^ du + F (T ) [T + / tve ^dt 

^T>00 0^ xxXq 

X— 

T 

+ Me + e + ( / f (t) e ^ dt) [T + / tve ^^dt 

1 0 ^ X 0 

+ Re"^^ + e"^^V ] . (4.12) 

r 



The same analysis that led to (4.7) can be applied to (4.12) to 

find the optimal and . There is a difference in the 

special case when M = R , ( • ) = F^ ( • ) = F ( • ) / T = T^ = T 

and V. = V = V where V = V - 1/v. 

1 r 

Now 



V(T) = D(T)/K(T) 



where 



T _ 

D(T) = / F(u)du 

0 



(4.13) 



(4.14) 



and 
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K(T) 



1 



^-V (M+T) 



(1 + V 



T 



/ 

0 



F (u) e^^du) 



(4.15) 



Lemma 4.2. 

In this special case of Model 2, a sufficient condition for 
S*, the optimal inspection interval to be finite is that 



r (oo) 



1 + vye 



-vM 



(4.16) 



where r(°o) = jtim f(s)/F(s) and y 

S^oo 

expected lifetime. 



/ tf(t)dt < oo is the 

0 



Proof . 

^ ^ 2 

At a local maximum or minimum of V(T) , v'(T) = h(T)/K(T) = 0 

where 



h (T) = F(T) (1-e 



-v(M+T) -v(M+T) 



- e 



V 



F(u) 



VU , , 

e du) 



T T 

- ( / F(u)du)ve“^^^'^'^^ ( / f(u)e^^^du + F(0))) (4.17) 

0 0 

2 

Since K(T) > 0, the condition V'(T) = 0 reduces to h(T) = 0. 
Notice that h(0) = F(0) [1-e > 0 but h(o°) = 0. Thus to 

insure the maximum is not at T = «>, we must show h' (T) is posi- 
tive as T tends to infinity. Differentiating h with respect 
to T, it follows that as T tends to infinity 



h' (T) ->■ -r(co) (1 + yve ^^) 



v^ye ^^(vb-1) + 



2 -vM 
y V e 



(4.18; 
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where 



r (°°) 



\;T — 

£im f(T)/F(T) = lim r(T), a = £im e F(T) , 

T->oo 



and 



T 

b = £im ( I F (y ) e^^dy ) /F (T) e^'^ . (4.19) 

T-^<» 0 

T 

vT t 

If F(T)e = exp(- J (r(t)-v)dt) -> c as T then b “ 

0 

and h'(°°) is positive; this certainly occurs if r(o°) > v. If 

— vT I 

F(T)e ^ 00 as T 00 , then L'Hopitals Rule says 



b = 



£.im 

T->-00 



F(T ) e 
vT 



vT 



vF(T)e - f(T)e 



vT 



v-r (oo) 



(4.20) 



Thus 



h' (T) 



- r ( oo ) ( 1 + e y ) 



, 2 -vM 

+ V ye 



r ( 0 °) 
v-r (oo) 



(4.21) 



— VT 

Since we are assuming F(T)e oo we have r (o°) £ v. If 

r(oo) < V then on checking when (4.21) is positive we get (4.16). 

Finally if r(oo) = v, then b = oo and h ' (T) is still positive at T = oo. 

As an example suppose F(t) = we^^, t > 0 so the unit has 
exponential lifetime with a probability 1-w of instantaneous 
failure, then the optimal inspection time T satisfies 



-(v+A)Tr, V X -vM -AT , , > -(v+A)T^ -VM AT 

i [ A-vw)- (2v+A) e e +v(w+l)e ' + ve e 



] = 0 



(4.22) 
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and the condition (4 
this equation is A > 



16) that guarantees a finite solution to 
V ( 1 - we . 
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5. TWO-UPSTATE MODEL 



Model 7 

We extend Model 2 of Section 3 to allow the unit to be in 
either one of two different up states: 1-up and 2-up, which 

have different failure rates. Let s^^, i = 1,2 be the proba- 
bility of remaining in state i next period given that it is in 
state i this period, and l~s^ is the probability it will fail 
in the next period. This model is intended to describe the 
situation in which a repair might only correct minor faults 
that caused the failure and not the underlying problem, which 
caused and will continue to cause these faults. We take as our 
state space S = {p,g)|0 £P^1/ 0 where p is the 

belief that the unit is up, and g is the ratio of the probability 
the unit is in the 1-up state to the probability it is in the 
2-up state. Thus in the state (p,g) the belief the unit is 
down, in the 1-up state and the 2-up state are respectively 

1-P/ gp/g+1/ p/g+1- 

We assume that after a repair the unit is in state (r,w) and 
define a = s^/s^i where without loss of generality, we assume 
s^ ^ S 2 . The occurrence of a successfully-dealt-with initiating 
event is treated as an inspection which takes no time. Let 
V(p,g) be the maximum extra number of periods under the best 
inspection policy until a catastrophic event, than if there was 
no standby unit (i.e., same definition as in Section 2) . 

Again, Denardo ' s results [4] guarantee the optimal policy to 
be a deterministic one, it satisfies the optimality equation 
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V (p,g) 



max{Wj^ (p,g) , W 2 (p,g), W 2 (p,g)} 



(5.1) 



Wj^(p,g) = P + (l- 6 )V(s 2 P(ag+l)/(g+l) ,ag) + 3pV(i,g) 

W 2 (P/g) = P(l-3)^ V(i,g) +(l-p)(l- 6 )^ V(r,w) 

W 3 (p,g) = (1-3)^ V(r,w) . 

The assumption is that an inspection affects the probability 
the unit is up, but not the ratio between the two up states, 
whereas a repair always returns the unit to the state (r,w) . 

(S 2 P (ag+1) /g+1 , ag) is the Bayesian updated belief of the state 
(p,g) , using the fact that no initiating event occurred. The 
optimal policy for this model is given as follows. 

Theroem 5.1. 

The optimal policy is given by a function p* (g) and a 
number g* so in state (p,g) , it does nothing if p > p*(g), 
inspects if p £ p*(g) , g > g*, and repairs if p £ p*(g) , g £ g*. 

Proof 

As in Theorem 3.1 an inductive proof on the iterates of 
value iteration proves that V(p,g) is convex and non-decreasing 
in p and non-decreasing in g. Now define 

Vlg = {p|v(p,g) > Wj^(p,g)}; then the linearity of and and 
the convexity of V in p guarantees Wg is convex, just as in 
Theorem 3.1. V(0,g) = V(0,g') since if p = 0 there is only 

one state. From (5.1) it follows that 
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V(0,g) 



max{ (1-6) V(0 ,ag) , (1-6)^ V(r,w) 



(5.2) 



By definition V(r,w) ^ 0 and if V(0,g) = Wj^(0,g) = (l-6)V(0,ag) = 
(l-6)V(0,g) then V(0,g) = 0, and hence 0 e Wg. Thus Wg = [0,p*(g)] 



and since V(i,g) is non-decreasing in g this gives the 
division between inspection and repair. 

Again we can rewrite the state space in terms of the number 
of periods since the last inspection and the last repair. Let 
S = { (m,n) I 0 £ m £ n £ oo} where (m,n) is the state which is m 
periods since the end of the last inspection or the end of 
repair if it followed from the last inspections and n non- 
inspection periods since the last repair. The state (m,n) is 
equivalent to g = a^w. 



If we define p(m,n) according to (5.3), the optimality equation 
for this state space is 



and result holds, g* satisfies (1-6)^ V(i,g*) = (1-6)^ V(r,w); 



p(m,n) 




(5.3) 



1 



p(m,n) + ( 1-6 ) V (m+1 ,n+l) + 6p (ni,n) V ( 0 ,n) 



V (m, n) 



max 



p(m,n) (1-6)^^ V(0,N) + (l-p(m,n)) (1-6)'^ V(0,0) 



(1-6)^ V(0,0) 



(5.4) 



40 



Theorem 5.1 can be reinterpreted for this state space. 

Corollary 5.1.. 

The optimal policy is given by a function m*(n) and a 
number n* so that at (m,n ) , do nothing if m < m*(n) ; inspect 
if m ^ m*(n) , n > n*; repair if m ^ m*(n) , n £ n* . Notice if 
i ^ r, n* = 0 and we always inspect. 

Again we can use value iterations on a finite state approxi- 
mation of the Markov decision model given by (5.4) (see White 
[22] for the bounds) . This gives us the results found in Table 5. 
namely the optimal periods for inspections, counting from the 
last repair. 

Note that the optimal inspection pattern appears to have 
short inter-inspection times just after a repair, which gradually 
increase to long inspection times, provided the system continues 
to be found up upon inspection. Hazardous inspection (i small) 
has a more drastic effect on the expected time to a catastrophic 
failure than similar changes in r, or s^^ and S 2 * 

Model 8 

As in Section 3, we could also model the situation in which 
the information acquired from successfully-dealt-wi th initiating 
events is ignored. Then B, if S 2 / M, R are still defined 

as in Model 7, but immediately after an inspection or repair 
the time to the next inspection or repair is determined, and 
which kind it will be. Immediately after a repair suppose the 
unit has probability r^^, r 2 respectively of being in the 1-up 
or 2-up state. The decision points are immediately after a 
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repair, and immediately after an inspection, where it is important 
to know the number of operating periods n since the last repair. 

We denote the maximum expected times until a catastrophic event 
at these decision points as ^ respectively. As in Model 4 

we can write down the optimality equation connecting these values: 



V 



max 

T ,W 
r r 



L(r,W^) + (1-f (r,W^) ) { { (1- (l-3)^)/6 + (l-B)^V^) ; 

T T 

Y' Y' M M 

L(r,T^) + ^ ^ (l-(l-6) /6) + ^'i,T ^ 

T T R R 

+ (1-r^s^^ - r 2 S 2 ^ - f (r,T^) ) ( (l-(l-B) /B + (1-3) V^) 



I 



V. 

i,n 



max 

T. 

l/H 



T . T . 

L(i (n) ,T^ +(^(n) j^Sj^^'^ + i(n) 2 S 2 ^'^) {(l-d-BT )/3 

T. T. 

+ (1-3) + (1-i (n) - i(n) 2 S 2 ^'^ 

' i,n 



- f(i(n),T. ^) ( (l-(l-3) ^/3 + (1-3)^V ) 

^ J- / X 1 -L 



(5.5) 



where r = (r^^ , r 2 , l-rj^-r 2 ) • If p = (pj^,P 2 /P 3 ) where Pj^ is the 
probability of being in the 1-up state, P 2 is the probability of 
being in the 2“Up state, and p^ is the probability of being 
down, then 



T-1 



T-2 

L(p,T) = T - p^ I (l-(l-3)^) - Pn I (l-(l-3)" ■^) (1-sJ-) 

k=l k=l 



T-k-1, 



T-2 



T-k-1, 



P2 I { (l-(l-3) " " ") (l-sp 



(5.6) 



43 



is the expected time until a catastrophic failure in first T 
periods starting in state £. 



i(n) = ( 



n 

ir^Si 



“ 2=2 



n . n n . n 



, 1-i) is the state of the system 



after inspection n operating periods after last repair, while 



f(H,T) = (l-6)f (£,T-1) +p^3(l-s^ +P26(1-S2 ^) + 3P3 (5.7) 



is the probability there has been a catastrophic failure within 
T periods, starting in state p. Again, the general results of 
Markov renewal programming [ 7 ] show that the only possible 
optimal policies are tt^(W), i.e., repair every W, or tt , . 

which is inspect periods after a repair, and Tj^ periods 
after the inspection after a repair. In order to find the 

optimal policy it is easier to work with V = V - 1/3 again, 
and using (5.5) we can show that under the policy tt^(W) if 

- (rj^,r2,l-rj^-r2) 

/I W. ,n W, 

r (l-s,) r^d-s-) P 

V = ^ (1 s ) ^ (l ~' s ~ ) ~^^^^ “ (1-f (r,W))(l-3)^) . (5.8) 



Under the policy ir^ (Tq , , . . . ) we get the following equations 

T T 

~ ^ ^ "^o "^o M ~ 

" (1-Sj^) (I-S 2 ) ^^1®1 ^^^2^2 



T T ~ 

+ (1 - rj^Sj^° - r 2 S 2 ° - f (r,T^) ) (1-3)^ . 



(5.9) 



44 



If T 



k-1 










T 



V. 



+ (l-i(T 





) ,T, ) (1-3)^ V, 



r 



(5.10) 



It appears somewhat difficult to solve (5.9) and (5.10) as we 
have an infinite set of equations. However, we can assume for 
all T, > N, for some N, V. ^ is approximately constant, since 

k i,Tj^ 

if a large number of periods have passed since the last repair, 
with no intervening failure, it is a good approximation to 
assume the unit is in the better of the two up states. This 
enables us to solve these equations using the bisection method 
reviewed in Thomas [19 . The method depends on the fact that if we 
substitute = c in the R.H.S. of (5.9) and (5.10) we can work 
back and solve for on the L.H.S. of (5.9) . If c is the 
correct value of , the L.H.S. of (5.9) is c, but if c > V^, 
it follows easily that the L.H.S. of (5.9) will be greater than 
c, while if c < it will be smaller than c. Using this as 
the basis of the bisection method and taking all inspections 
more than 50 periods after a repair as the same, we get the 
forms of the approximately optimal policies found in Table 6; 

(the units of time are weeks) . 
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The parameters in the comparable continuous time model of 
Section 2 are (in units of weeks) : v = 01 , = ^ = 1 - , 

= .04, 62 = 0.5, M = 0.035 and R = 0.07. The corresponding 
best policies under the "short-long" inspection rule of Section 
2 with inter-inspection times restricted to being multiplies 
of a week are as follows; 

Table 7 



Case 


OKI 


OKR 


Best 


Policy 


Best expected tim.e to 
a catastrophic Event 


I 


0.9 


0.9 


1 


(4 


times) , 2 


61.09 


II 


0.9 


0.5 


1 


(2 


times) , 3 


42.19 


III 


0.5 


0.9 


3 


( 1 , 


time) , 1 


29.37 


IV 


0.5 


0.5 


3 


(1 


time) , 0 ° 


19.98 



The difference in policies for Case III results from the 
fact that the discrete time model allows a decision of repair 
without inspection. The differences in the policies for cases 
I, II, and IV come about because the continuous time model only 
allows inspection periods of two different lengths whereas 
the optimal policy in the discrete time model goes gradually 
from the length of the inspection period just after a repair 
to an asymptotic inspection period if the inspections are 
successful. However, subject to its restrictions, the policy 
of the continuous time model is comparable to that of the dis- 
crete time model. 
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The differences between the best expected times to a 
catastrophic event in the two models results from the discreti- 
zation of time in Model 8. If the time interval in the discrete 
time model of Case I is taken to be 1/10 week instead of 1 week 
with the resulting change of parameters 3 = .01, i = 0.9, 
r^^ = 0.6, r^ = 0.3, s^ = .996, S 2 = .95, M = 0.35, R = 0.7, 
then the optimal policy is inspect 7 periods after a repair, 
and if up, then 8 periods later, then 9, 11, 13, 16, 18 and 
20 periods and the expected time until catastrophic failure is 
626.0 periods. In the original time scale this is a time of 
62.6 weeks. Note that the difference between the expected time 
to a catastrophic event is now small for the two models. This 
suggests that the policy that was proposed in Section 2, while 
not optimal, is a good one. 
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6 . CONCLUSIONS 



The following conclusions can be drawn about the form of 
the optimal policy, by studying the models in this paper. 

1) If the failure rate of the system increases with age, 
then the inspection intervals should decrease, and do. Numeri- 
cal examples based on Model 1 have borne this out. The model 
calculations suggest optimal intervals based on the underlying 
parameters . 

2) If there is only one state the unit can be in when it 
is 'up', and the probability of being up, i, is the same after 
each inspection and the probability of being up after a repair 
is also a constant r, then the optimal policy is to have one 
'short' inspection interval after a repair, and a 'longer' 
inspection interval always thereafter (i > r) or else to repair 
at fixed intervals with no inspection (r considerably larger 

than i) . The 'longer' inspection interval must always be at least as 
long as the 'short' initial inspection interval. 

3) The results of 1) and 2) hold whether or not successful ly-deal 
with initiating events are considered as a type of inspection. 

However, there are considerable differences in the actual in- 
spection periods for these two cases. 

4) In order for the optimal inspection problem to require 
several 'short' inspection intervals followed by longer ones 

it is necessary to assume the unit can be in more than one 'up' 
state with different failure rates. In this case there is 
not an abrupt jump from 'short' inspection intervals to 'long', 
but a gradual increase in the inspection interval. However, 
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there is a suggestion that a policy comparable to the optimal 
one in which there is a sharp jump between short inspections 
and long ones, will give the expected time to a catastrophic 
event that is close to that achieved by the optimal policy. 
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